I want to first check if there is any variation in loan volumes.Since this data is US based I want to check if volumes increase or decrease during certain months i.e during the holiday season,Thanksgiving etc, Since people go on a shopping spree during these months defaults on loans may also increase during this period is this the case?
Increase in loan volume during holiday season Oct,Nov,Dec,Jan can be noticed .
This distribution is similar to loan volumes by months so defaults seem to increase during holiday season too
I want to check if the loan volumes have changed over the years.Prosper started in 2005 GFC occured in 2008 so did the volumes change over that period.
It appears loans peaked in 2013 then in the year 2009-2010 dropped off and then started picking up again.Note we do not haveall the data for 2014.
It will be interesting to understand why volumes fell in 2009
Which state uses Prosper and takes out most loans.
From the graph CA takes out most loans
Do people take short term loans or do they take long term loans
Most loans are 36 months
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.6 217.7 272.5 371.6 2252.0
The loan payment spread is positively skewed with the most common repayment about $200
The distribution of borrowerAPR is slightly positively skewed and there is spike about 0.35%
Prosper score are a custom risk score built using historical Prosper data applicable for loans originated after July 2009. Most loans from the graph seem to be between 4-8 prosper score I guess relatively high prosper scores should predict a good loan outcome . It would be intersting to see if low prosper scores give higher lender yield and vice versa and also if prosper scores that are high predict a good loan outcome
It would be intersting to understand why people are using P2P lending as opposed to Banks. For lenders its obvious the yields are higher though risks should be higher and due deligence work will be higher for for borrowers the Reasons I could include turnaround time ,low credit scores ,lower repayments .
Most loans are for debt consolidation as seen in graph
Most people take on reasonable debt but there are few borrowers who take on very large debt.
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
Largest group of borrowers have an income range of 250000-50000 closely followed by 50000-75000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1242 0.1730 0.1827 0.2400 0.4925
Lender yield spread is positively skewed large percentage of loans yield close to 0.3%
Data has 113937 observation with 81 variables ###What is/are the main feature(s) of interest in your dataset? This data set has 81 variables so I chose a subset of the data as features to study.The features chosen are
BorrowerState LoanOriginalAmount BorrowerAPR ProsperScore LenderYield CreditScoreRangeLower
I created 6 variables Loan_year,Loan_month,Loan_closed_Date_month, Loan_closed_Date_year,Credit_Type,TotalMonthlyDebt
Other features I have used to futher support my investigation are .
LoanOriginationDate ClosedDate Term BorrowerAPR ProsperScore ListingCategory Occupation EmploymentStatus EmploymentStatusDuration CurrentCreditLines DebtToIncomeRatio IncomeRange StatedMonthlyIncome Recommendations
The plot of Loan volumes over the years shows that loans have abruptly fallen in 2009 which was unusual since it was growing year on year from 2005.
ggplot(aes(x=factor(Loan_year),fill=Credit_Type),data=loan_data_csv)+
geom_histogram()+
xlab("Loan_year")+
ggtitle("Borrower profile over the years")
From the graph we can see that credit requirements have become more stricter from 2009 onwards loans are given to borrowers with atleast fair credit.
The plot shows default was very high in 2006 and then has fallen steadily after 2009 to 2013 The next question would be why such a dramatic change i.e why have defaults fallen so much have lending standards improved
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$ProsperScore and loan_data_csv$BorrowerAPR
## t = -261.68, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6719940 -0.6645469
## sample estimates:
## cor
## -0.6682872
Prosperscore and BorrowerAPR are strongly negatively corelated
ProsperScore and BorrowerAPR are very negatively correlated The graphs Prosper score vs BorrowerAPR and BorrowerAPR spread both indicate that Interest rate fall for loans with high prosperscore and vice versa
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$LoanOriginalAmount and loan_data_csv$ProsperScore
## t = 80.475, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2600308 0.2725335
## sample estimates:
## cor
## 0.2662933
Both loan original amount and prosperscore have a moderate corelation this indicates Prosperscore increases these loans could potentially have a larger loan amount
Plots indicate Loans with low prosperscores have low amounts and those with higher prosper scores can have higher loan amounts
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$CreditScoreRangeLower and loan_data_csv$BorrowerAPR
## t = -160.21, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4344422 -0.4249487
## sample estimates:
## cor
## -0.4297073
Both creditscore and borrower interest rate are strongly negatively correlated indicating that borrowers with low credit score probably pay more interest and borrowers with good credit score pay less interest
The general trend is that as creditscore increases borrower APR decreases as evidenced by the graph.
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$StatedMonthlyIncome and loan_data_csv$LoanOriginalAmount
## t = 69.353, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1956816 0.2068243
## sample estimates:
## cor
## 0.2012595
Both monthly income and loan amount are positively correlated this suggests that people on higher incomes can potentially take on bigger loans
Plots show that people who are on larger incomes can take on larger loans
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$Term and loan_data_csv$LoanOriginalAmount
## t = 121.6, df = 113940, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3337778 0.3440569
## sample estimates:
## cor
## 0.3389275
LoanOriginalAmount and loan term are positively correlated which implies larger loans are taken over a longer period
The above plot show that longer term loans are usually larger loans
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$CreditScoreRangeLower and loan_data_csv$LenderYield
## t = -171.71, df = 113340, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4589577 -0.4497179
## sample estimates:
## cor
## -0.45435
CreditScore and Lenderyield negatively corelated implying as Credit Scores increases lender yield decreases and viceversa
As credit score increases lender yield trends down as evidenced by the plot
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$ProsperScore and loan_data_csv$LenderYield
## t = -249.01, df = 84851, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6536541 -0.6458788
## sample estimates:
## cor
## -0.6497835
Lender yield and prosper score are highly negatively correlated implying as prosper score increases lender yield decreases and vice versa
The above plot reaffirms that for loans with high prosper scores the yield falls.
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$LenderYield and loan_data_csv$BorrowerAPR
## t = 2291.7, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9892049 0.9894515
## sample estimates:
## cor
## 0.9893289
Both lender yield and Borrower APR are very highly corelated . The relationship is pretty linear
There is a strong negative relationship between prosperscore and borrowerapr meaning loans with higher scores have lower interest rate.
There is a very strong positive corelation between lender yield and borrowerAPR. This implies that loans where lender yield increases have larger interest rates
There is a strong negative corelation between prosperscore and lender yield This implies that loans with good prosper scores have lesser yield and viceversa
There is a strong negative correlation between creditscore and borrowerapr i.e people with bigger creditscores get cheaper loans
LoanAmount and term of loans are positively correlated meaning Larger loans are taken over a longer period of time
There is a Negative correlation between credit scores and lender yield implying more risk more reward
I found that the lending criteria of prosper has become more stringent . Loans are given only to people with reasonably good Credit scores. I also noticed default rates have fallen considerably over the years .
The most strongest relationship is between BorrowerAPR and lenderyield. LenderYield is high then borrowerAPR is high and viceversa. This could be true since LenderYield is defined as interest rate less service fee i.e BorrowerAPR in some part determines lenderyield
Some states have an abrupt distribution like IA,ME,ND after further research I found that these states have disallowed prosper. CA seems to have the most loans followed by NY ,TX,GA,FL. RI,NV,SD dont have data for 2005-2008 Prosper was introduced here after 2008.
The money is in risky investments as evidenced by graph yield is high where the DebtToIncomeRatio>1 .The prosper score of these loans is low.
##
## Pearson's product-moment correlation
##
## data: loan_data_csv$BorrowerAPR and loan_data_csv$LoanOriginalAmount
## t = -115.14, df = 113910, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3280787 -0.3176752
## sample estimates:
## cor
## -0.3228867
Good credit scores are able to borrow at a low interest rate on larger loans No particular relationship between loan amount and borrowerAPR
From the above plots it can be concluded people who are employed and on a relatively highwage with good credit score take on higher debts.
lr<-lm(LenderYield~BorrowerAPR+ProsperScore+
CreditScoreRangeLower+
DebtToIncomeRatio,data=loan_data_csv)
summary(lr)
##
## Call:
## lm(formula = LenderYield ~ BorrowerAPR + ProsperScore + CreditScoreRangeLower +
## DebtToIncomeRatio, data = loan_data_csv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.070892 -0.004182 -0.000523 0.005379 0.021991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.666e-02 6.078e-04 -93.233 <2e-16 ***
## BorrowerAPR 9.531e-01 5.595e-04 1703.592 <2e-16 ***
## ProsperScore 7.480e-04 1.704e-05 43.898 <2e-16 ***
## CreditScoreRangeLower 3.202e-05 7.520e-07 42.582 <2e-16 ***
## DebtToIncomeRatio -3.079e-04 9.432e-05 -3.264 0.0011 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.008245 on 77552 degrees of freedom
## (36380 observations deleted due to missingness)
## Multiple R-squared: 0.9876, Adjusted R-squared: 0.9876
## F-statistic: 1.538e+06 on 4 and 77552 DF, p-value: < 2.2e-16
#r^2 is 0.9876 .The linear model is very good at predicting lenderyield
#as evidenced by the R^2. The variables are significant hence I have included
#them all .Some of the independent variables have a high corelation among
#them so there could be a multicolinearity problem
library(car)
vif(lr)
## BorrowerAPR ProsperScore CreditScoreRangeLower
## 2.237766 1.848829 1.435037
## DebtToIncomeRatio
## 1.028572
#The VIFs are not too large so the model does not exhibit multi colinearity
A linear models is built predicting Lender yield using Borrower APR. The linear model has an R^2 of 0.9876 which is pretty good. All independent variables are significant
I observed LoanAmounts versus monthly income using features IncomeRange, EmploymentStatus ,Creditscore to understand the relationship further. After looking at the plots I could conclude that people who are employed with a good salary and reasonable credit scores take on larger loans. I also found that lender yield increases in risky investments . I also observed that Prosper’s lending criteria has become more stricter to what it was a few years back.
I observed loan volumes fell off in 2009 .After searching online I found that the SEC had put a cease and desist order on Prosper in Nov 2008.It also appears from the plots that Prosper have made their lending criteria more stringent from the time they started they seem to give loans only to people with good credit history.
Loan volumes have increased drastically since 2005 with a dip in 2009.The plot also showed that prosper did not launch in all states simultaneously .In some states it started later like Rhode Island,nevada and south dakota and in some states its still not available like Maine, Iowa, and North Dakota.
Borower profile in terms of credit score lower has changed since 2006 to 2014 In 2006 we did have some low <500 loans and in 2014 there are no such loans all are above atleast 600 this could account for more defaults early on .
The above plot shows that lender yield increases as borrowerAPR of loan increases.Notice that the relationship is linear as evidenced by the red line The higher lender yield also corresponds to more riskier loans as evidenced by the color of the points .
This data set is pretty large with many different variables. My first difficulty was understanding how the business peer to peer lending worked then I tried to understand what the various variables in the data set meant .Initially I choose far too many features then slowly I brought that down to a few main ones. Using EDA I then tried to explore their relationships. I wanted to understand why a lender/borrower would opt for p2p lending rather than go to a bank. It would be useful if I could study what the investor return would be using p2p and a brick and mortar bank or bonds or shares. Similarly for a borrower what the interest rate would be for p2p and a standard bank.
I then tried to understand what drives the lender yield.The data inconculsively shows that like all equities risky behaviour is rewarding I then tried to model what determines lender yield. My model has very few variables and a good R^2.I notice that the independent variables are very well corelated so I try to check for multi collinearity by calculating VIF(Variance Inflation Factors).These turn out to be reasonable so I include all the independent variables in the model.It would also be v interesting to model Prosper score.On what basis does Prosper allocate Prosperscore to its loans .I did try modelling the same using a linear model but my R^2 was not very good 0.63.
I can conclude that a platform like Prosper gives a good returns to investors the number of defaults over the years have fallen since granting of loans is screened more and only worthy borrowers are given loans. Since the P2P lending space has become more competitive it would be interesting to see if the returns that the investors are currently giving will continue or not.